We focus on Web Data Commons - Training Dataset and Gold Standard for Large-Scale Prod- uct Matching dataset (WDC for short) prepared by the staff of the University of Mannheim (Primpeli et al., 2019).
Additionally, each offer is linked to a specific product (cluster id) and contains textual attributes such as title, description etc. </br> </br> The training datasets are available in different sizes, varying from small to extra large. In every dataset, the ratio between positive and negative pairs is 1:3. </br> The proportions between the different collections within one category are as follows: </br>
import pandas as pd
from eda_source import *
The data is provided as four different files - each per category: Computers, Cameras, Watches, and Shoes. This division is helpful as we plan to train separate models for each category, so no additional merging is required.
# Training data
df_computers = pd.read_json(path_or_buf='./data/computers_train_large.json', lines=True)
df_cameras = pd.read_json(path_or_buf='./data/cameras_train_large.json', lines=True)
df_shoes = pd.read_json(path_or_buf='./data/shoes_train_large.json', lines=True)
df_watches = pd.read_json(path_or_buf='./data/watches_train_large.json', lines=True)
Each observation is a pair of such offers and a label indicating whether these two offers are for the same product (a positive pair) or not (a negative pair). Even in the case of a negative pair, both offers belong to the same category (but different clusters/products).
print(f"In all categories there are the same features: \n {df_computers.columns.values}")
In all categories there are the same features: ['id_left' 'title_left' 'description_left' 'brand_left' 'price_left' 'specTableContent_left' 'keyValuePairs_left' 'category_left' 'cluster_id_left' 'identifiers_left' 'id_right' 'title_right' 'description_right' 'brand_right' 'price_right' 'specTableContent_right' 'keyValuePairs_right' 'category_right' 'cluster_id_right' 'identifiers_right' 'label' 'pair_id']
Feature meaning: </br> </br> id: Unique integer identifier of an offer </br> cluster_id: The integer ID of the cluster (product) to which an offer belongs. </br> identifiers: A list of all identifier values that were assigned to an offer together with the schema.org terms that were used to annotate the values. </br> category: One of 25 product categories the product was assigned to, NaN if not part of the English subset. </br> title: The product title. </br> description: The product description. </br> brand: The product brand. </br> price: The product price. </br> specTableContent: The specification table content of the products website as one string. </br> keyValuePairs: The key-value pairs that were extracted from the specification tables using the method described above. </br> </br> </br> Note: the 'right' suffix represents a first offer in a pair, the 'left' suffix represents the other offer. </br> A positive pair: cluster_id_left == cluster_id_right </br> A negative pair: cluster_id_left != cluster_id_right </br>
df_computers.head()
| id_left | title_left | description_left | brand_left | price_left | specTableContent_left | keyValuePairs_left | category_left | cluster_id_left | identifiers_left | ... | description_right | brand_right | price_right | specTableContent_right | keyValuePairs_right | category_right | cluster_id_right | identifiers_right | label | pair_id | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 16876009 | 495906 b21 hp x5560 2 80ghz ml350 g6 , null ne... | description intel xeon x5560 ml350 g6 2 80ghz ... | hp enterprise | None | specifications category proliant processor sub... | {'category': 'proliant processor', 'sub catego... | Computers_and_Accessories | 1679624 | [{'/sku': '[495906b21]'}, {'/mpn': '[495906b21... | ... | description intel xeon x5560 ml350 g6 2 80ghz ... | hp enterprise | usd 213 85 | specifications category proliant processor sub... | {'category': 'proliant processor', 'sub catego... | Computers_and_Accessories | 1679624 | [{'/mpn': '[495906b21]'}] | 1 | 16876009#16248399 |
| 1 | 16876009 | 495906 b21 hp x5560 2 80ghz ml350 g6 , null ne... | description intel xeon x5560 ml350 g6 2 80ghz ... | hp enterprise | None | specifications category proliant processor sub... | {'category': 'proliant processor', 'sub catego... | Computers_and_Accessories | 1679624 | [{'/sku': '[495906b21]'}, {'/mpn': '[495906b21... | ... | None | None | None | categorie processors merk hp product hp intel ... | {'categorie': 'processors', 'merk': 'hp', 'pro... | Computers_and_Accessories | 1679624 | [{'/mpn': '[495906b21]'}, {'/gtin13': '[884420... | 1 | 16876009#5490217 |
| 2 | 16721450 | asus prime x299 deluxe prijzen tweakers | None | None | None | categorie moederborden merk asus product asus ... | {'categorie': 'moederborden', 'merk': 'asus', ... | Computers_and_Accessories | 109916 | [{'/mpn': '[primex299deluxe, 90mb0ty0m0eay0]'... | ... | support for x series intel core processors sli... | asus | None | None | None | Computers_and_Accessories | 109916 | [{'/mpn': '[90mb0ty0m0eay0]'}, {'/gtin13': '[4... | 1 | 16721450#10358026 |
| 3 | 14031864 | asus prime x299 deluxe | placa base atx socket lga2066 chipset intel x2... | None | None | None | None | Computers_and_Accessories | 109916 | [{'/productID': '[90mb0ty0m0eay0]'}] | ... | atx quad channel ddr4 3 x pcie 3 0 x16 2 x m 2... | asus | None | None | None | Computers_and_Accessories | 109916 | [{'/productID': '[asux29del]'}, {'/mpn': '[pri... | 1 | 14031864#4588573 |
| 4 | 14031864 | asus prime x299 deluxe | placa base atx socket lga2066 chipset intel x2... | None | None | None | None | Computers_and_Accessories | 109916 | [{'/productID': '[90mb0ty0m0eay0]'}] | ... | None | None | None | categorie moederborden merk asus product asus ... | {'categorie': 'moederborden', 'merk': 'asus', ... | Computers_and_Accessories | 109916 | [{'/mpn': '[primex299deluxe, 90mb0ty0m0eay0]'... | 1 | 14031864#16721450 |
5 rows × 22 columns
n = df_computers.shape[0]
m = df_computers.shape[1]
print(f'Number of pairs: {n}, number of features: {m}')
Number of pairs: 33359, number of features: 22
print(f'Number of positive pairs: {get_positive_pairs_count(df_computers)} - {round(get_positive_pairs_count(df_computers) * 100 / n, 2)}% of all observations.')
print(f'Number of negative pairs: {get_negative_pairs_count(df_computers)} - {round(get_negative_pairs_count(df_computers) * 100 / n, 2)}% of all observations.')
Number of positive pairs: 27213 - 81.58% of all observations. Number of negative pairs: 6146 - 18.42% of all observations.
plot_positive_vs_negative(df_computers)
We plan to concatenate the features into one text for the input of models. Therefore, missing values are manageable.
missing_values = {}
col = 'title_left'
print_missing_values(df_computers, col, missing_values)
Number of missing values in the column: title_left = 0, which is 0.0% of all pairs>
col = 'title_right'
print_missing_values(df_computers, col, missing_values)
Number of missing values in the column: title_right = 0, which is 0.0% of all pairs>
col = 'description_left'
print_missing_values(df_computers, col, missing_values)
Number of missing values in the column: description_left = 10279, which is 30.81% of all pairs>
col = 'description_right'
print_missing_values(df_computers, col, missing_values)
Number of missing values in the column: description_right = 9984, which is 29.93% of all pairs>
col = 'brand_left'
print_missing_values(df_computers, col, missing_values)
Number of missing values in the column: brand_left = 16878, which is 50.6% of all pairs>
col = 'brand_right'
print_missing_values(df_computers, col, missing_values)
Number of missing values in the column: brand_right = 17081, which is 51.2% of all pairs>
col = 'price_left'
print_missing_values(df_computers, col, missing_values)
Number of missing values in the column: price_left = 27252, which is 81.69% of all pairs>
col = 'price_right'
print_missing_values(df_computers, col, missing_values)
Number of missing values in the column: price_right = 27510, which is 82.47% of all pairs>
col = 'specTableContent_left'
print_missing_values(df_computers, col, missing_values)
Number of missing values in the column: specTableContent_left = 22610, which is 67.78% of all pairs>
col = 'specTableContent_right'
print_missing_values(df_computers, col, missing_values)
Number of missing values in the column: specTableContent_right = 22170, which is 66.46% of all pairs>
plot_missing_values(missing_values, n)
Number of missing (for both right and left offer simultaneously) values per feature.
missing_values_simul = {}
col_prefix = 'title'
print_missing_values_simultaneously(df_computers, col_prefix, missing_values_simul)
Number of missing values in the column: (both title_right and title_left): = 0, which is 0.0% of all pairs>
col_prefix = 'description'
print_missing_values_simultaneously(df_computers, col_prefix, missing_values_simul)
Number of missing values in the column: (both description_right and description_left): = 3905, which is 11.71% of all pairs>
col_prefix = 'brand'
print_missing_values_simultaneously(df_computers, col_prefix, missing_values_simul)
Number of missing values in the column: (both brand_right and brand_left): = 10962, which is 32.86% of all pairs>
col_prefix = 'price'
print_missing_values_simultaneously(df_computers, col_prefix, missing_values_simul)
Number of missing values in the column: (both price_right and price_left): = 23179, which is 69.48% of all pairs>
col_prefix = 'specTableContent'
print_missing_values_simultaneously(df_computers, col_prefix, missing_values_simul)
Number of missing values in the column: (both specTableContent_right and specTableContent_left): = 17410, which is 52.19% of all pairs>
plot_missing_values_simultaneously(missing_values_simul, n)
avg_lens = {}
avg_lens_pos = {}
avg_lens_neg = {}
col = 'title_right'
print_avg_lengths(df_computers, col, avg_lens, avg_lens_pos, avg_lens_neg)
The average number of the column: title_right for - all pairs = 70.0691867262208, pos. pairs = 69.8634843640907, neg. pairs = 70.97998698340383
col = 'title_left'
print_avg_lengths(df_computers, col, avg_lens, avg_lens_pos, avg_lens_neg)
The average number of the column: title_left for - all pairs = 69.39866302946731, pos. pairs = 69.45926579208466, neg. pairs = 69.13032866905304
col = 'description_left'
print_avg_lengths(df_computers, col, avg_lens, avg_lens_pos, avg_lens_neg)
The average number of the column: description_left for - all pairs = 239.65781348361762, pos. pairs = 236.40939991915627, neg. pairs = 254.04100227790434
col = 'description_right'
print_avg_lengths(df_computers, col, avg_lens, avg_lens_pos, avg_lens_neg)
The average number of the column: description_right for - all pairs = 224.62957522707515, pos. pairs = 222.36438466909198, neg. pairs = 234.65929059550928
plot_avg_lengths(avg_lens, avg_lens_pos, avg_lens_neg, n)
df_cameras.head()
| id_left | title_left | description_left | brand_left | price_left | specTableContent_left | keyValuePairs_left | category_left | cluster_id_left | identifiers_left | ... | description_right | brand_right | price_right | specTableContent_right | keyValuePairs_right | category_right | cluster_id_right | identifiers_right | label | pair_id | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 13664621 | canon ef extender 2x iii digital slr camera ca... | for purchase inquires please call 416 977 9711... | None | None | None | None | Camera_and_Photo | 495045 | [{'/sku': '[4410b002]'}, {'/gtin13': '[1380312... | ... | None | None | None | None | None | Camera_and_Photo | 495045 | [{'/gtin14': '[13803122152]'}] | 1 | 13664621#13797854 |
| 1 | 17100990 | canon ef 2x iii extender accessories onestop d... | p the canon extender ef 2x iii is designed for... | None | None | None | None | Camera_and_Photo | 495045 | [{'/gtin14': '[13803122152]'}] | ... | extender ef 2x iii | canon | None | None | None | Camera_and_Photo | 495045 | [{'/productID': '[4410b002]'}] | 1 | 17100990#11150432 |
| 2 | 17100990 | canon ef 2x iii extender accessories onestop d... | p the canon extender ef 2x iii is designed for... | None | None | None | None | Camera_and_Photo | 495045 | [{'/gtin14': '[13803122152]'}] | ... | multiplies select telephoto lens s focal lengt... | None | None | None | None | Camera_and_Photo | 495045 | [{'/mpn': '[4410b002]'}] | 1 | 17100990#12939297 |
| 3 | 12939297 | canon ef 2x iii extender multiplies select tel... | multiplies select telephoto lens s focal lengt... | None | None | None | None | Camera_and_Photo | 495045 | [{'/mpn': '[4410b002]'}] | ... | extender ef 2x iii | canon | None | None | None | Camera_and_Photo | 495045 | [{'/productID': '[4410b002]'}] | 1 | 12939297#11150432 |
| 4 | 13664621 | canon ef extender 2x iii digital slr camera ca... | for purchase inquires please call 416 977 9711... | None | None | None | None | Camera_and_Photo | 495045 | [{'/sku': '[4410b002]'}, {'/gtin13': '[1380312... | ... | extender ef 2x iii | canon | None | None | None | Camera_and_Photo | 495045 | [{'/productID': '[4410b002]'}] | 1 | 13664621#11150432 |
5 rows × 22 columns
n = df_cameras.shape[0]
m = df_cameras.shape[1]
print(f'Number of pairs: {n}, number of features: {m}')
Number of pairs: 20036, number of features: 22
print(f'Number of positive pairs: {get_positive_pairs_count(df_cameras)} - {round(get_positive_pairs_count(df_cameras) * 100 / n, 2)}% of all observations.')
print(f'Number of negative pairs: {get_negative_pairs_count(df_cameras)} - {round(get_negative_pairs_count(df_cameras) * 100 / n, 2)}% of all observations.')
Number of positive pairs: 16193 - 80.82% of all observations. Number of negative pairs: 3843 - 19.18% of all observations.
plot_positive_vs_negative(df_cameras)
We plan to concatenate the features into one text for the input of models. Therefore, missing values are manageable.
missing_values = {}
col = 'title_left'
print_missing_values(df_cameras, col, missing_values)
Number of missing values in the column: title_left = 0, which is 0.0% of all pairs>
col = 'title_right'
print_missing_values(df_cameras, col, missing_values)
Number of missing values in the column: title_right = 0, which is 0.0% of all pairs>
col = 'description_left'
print_missing_values(df_cameras, col, missing_values)
Number of missing values in the column: description_left = 4528, which is 22.6% of all pairs>
col = 'description_right'
print_missing_values(df_cameras, col, missing_values)
Number of missing values in the column: description_right = 4969, which is 24.8% of all pairs>
col = 'brand_left'
print_missing_values(df_cameras, col, missing_values)
Number of missing values in the column: brand_left = 11216, which is 55.98% of all pairs>
col = 'brand_right'
print_missing_values(df_cameras, col, missing_values)
Number of missing values in the column: brand_right = 11592, which is 57.86% of all pairs>
col = 'price_left'
print_missing_values(df_cameras, col, missing_values)
Number of missing values in the column: price_left = 18008, which is 89.88% of all pairs>
col = 'price_right'
print_missing_values(df_cameras, col, missing_values)
Number of missing values in the column: price_right = 18314, which is 91.41% of all pairs>
col = 'specTableContent_left'
print_missing_values(df_cameras, col, missing_values)
Number of missing values in the column: specTableContent_left = 17159, which is 85.64% of all pairs>
col = 'specTableContent_right'
print_missing_values(df_cameras, col, missing_values)
Number of missing values in the column: specTableContent_right = 16965, which is 84.67% of all pairs>
plot_missing_values(missing_values, n)
Number of missing (for both right and left offer simultaneously) values per feature.
missing_values_simul = {}
col_prefix = 'title'
print_missing_values_simultaneously(df_cameras, col_prefix, missing_values_simul)
Number of missing values in the column: (both title_right and title_left): = 0, which is 0.0% of all pairs>
col_prefix = 'description'
print_missing_values_simultaneously(df_cameras, col_prefix, missing_values_simul)
Number of missing values in the column: (both description_right and description_left): = 1547, which is 7.72% of all pairs>
col_prefix = 'brand'
print_missing_values_simultaneously(df_cameras, col_prefix, missing_values_simul)
Number of missing values in the column: (both brand_right and brand_left): = 7840, which is 39.13% of all pairs>
col_prefix = 'price'
print_missing_values_simultaneously(df_cameras, col_prefix, missing_values_simul)
Number of missing values in the column: (both price_right and price_left): = 16693, which is 83.32% of all pairs>
col_prefix = 'specTableContent'
print_missing_values_simultaneously(df_cameras, col_prefix, missing_values_simul)
Number of missing values in the column: (both specTableContent_right and specTableContent_left): = 14776, which is 73.75% of all pairs>
plot_missing_values_simultaneously(missing_values_simul, n)
avg_lens = {}
avg_lens_pos = {}
avg_lens_neg = {}
col = 'title_right'
print_avg_lengths(df_cameras, col, avg_lens, avg_lens_pos, avg_lens_neg)
The average number of the column: title_right for - all pairs = 71.8657915751647, pos. pairs = 71.86333600938677, neg. pairs = 71.87613843351548
col = 'title_left'
print_avg_lengths(df_cameras, col, avg_lens, avg_lens_pos, avg_lens_neg)
The average number of the column: title_left for - all pairs = 72.59253343980835, pos. pairs = 72.49052059531897, neg. pairs = 73.02237835024721
col = 'description_left'
print_avg_lengths(df_cameras, col, avg_lens, avg_lens_pos, avg_lens_neg)
The average number of the column: description_left for - all pairs = 539.3376422439609, pos. pairs = 529.4947199407151, neg. pairs = 580.8121259432735
col = 'description_right'
print_avg_lengths(df_cameras, col, avg_lens, avg_lens_pos, avg_lens_neg)
The average number of the column: description_right for - all pairs = 448.36953483729286, pos. pairs = 428.8500586673254, neg. pairs = 530.6174863387978
plot_avg_lengths(avg_lens, avg_lens_pos, avg_lens_neg, n)
df_shoes.head()
| id_left | title_left | description_left | brand_left | price_left | specTableContent_left | keyValuePairs_left | category_left | cluster_id_left | identifiers_left | ... | description_right | brand_right | price_right | specTableContent_right | keyValuePairs_right | category_right | cluster_id_right | identifiers_right | label | pair_id | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 8305739 | interactive sneaker hogan women shoes brunaros... | None | None | None | None | None | Shoes | 13980798 | [{'/sku': '[hxw00n0s3609keu810]'}] | ... | None | None | None | None | None | Shoes | 13980798 | [{'/sku': '[hxw00n0s3609keu810]'}] | 1 | 8305739#2447385 |
| 1 | 11265724 | hogan interactive lav h spezzata ricamo hxw00n... | sneakers in nabuk con allacciatura frontale et... | hogan | None | None | None | Shoes | 13980798 | [{'/sku': '[hxw00n0s3609keu810]'}] | ... | None | None | None | None | None | Shoes | 13980798 | [{'/sku': '[hxw00n0s3609keu810]'}] | 1 | 11265724#6638421 |
| 2 | 11265724 | hogan interactive lav h spezzata ricamo hxw00n... | sneakers in nabuk con allacciatura frontale et... | hogan | None | None | None | Shoes | 13980798 | [{'/sku': '[hxw00n0s3609keu810]'}] | ... | None | None | None | None | None | Shoes | 13980798 | [{'/sku': '[hxw00n0s3609keu810]'}] | 1 | 11265724#2447385 |
| 3 | 6638421 | hogan sneakers interactive aus wildleder in na... | None | None | None | None | None | Shoes | 13980798 | [{'/sku': '[hxw00n0s3609keu810]'}] | ... | None | None | None | None | None | Shoes | 13980798 | [{'/sku': '[hxw00n0s3609keu810]'}] | 1 | 6638421#8305739 |
| 4 | 6987609 | hogan interactive sneaker in dark blue suede | None | None | None | None | None | Shoes | 13980798 | [{'/sku': '[hxw00n0s3609keu810]'}] | ... | None | None | None | None | None | Shoes | 13980798 | [{'/sku': '[hxw00n0s3609keu810]'}] | 1 | 6987609#6638421 |
5 rows × 22 columns
n = df_shoes.shape[0]
m = df_shoes.shape[1]
print(f'Number of pairs: {n}, number of features: {m}')
Number of pairs: 22989, number of features: 22
print(f'Number of positive pairs: {get_positive_pairs_count(df_shoes)} - {round(get_positive_pairs_count(df_shoes) * 100 / n, 2)}% of all observations.')
print(f'Number of negative pairs: {get_negative_pairs_count(df_shoes)} - {round(get_negative_pairs_count(df_shoes) * 100 / n, 2)}% of all observations.')
Number of positive pairs: 19507 - 84.85% of all observations. Number of negative pairs: 3482 - 15.15% of all observations.
plot_positive_vs_negative(df_shoes)
We plan to concatenate the features into one text for the input of models. Therefore, missing values are manageable.
missing_values = {}
col = 'title_left'
print_missing_values(df_shoes, col, missing_values)
Number of missing values in the column: title_left = 0, which is 0.0% of all pairs>
col = 'title_right'
print_missing_values(df_shoes, col, missing_values)
Number of missing values in the column: title_right = 0, which is 0.0% of all pairs>
col = 'description_left'
print_missing_values(df_shoes, col, missing_values)
Number of missing values in the column: description_left = 8247, which is 35.87% of all pairs>
col = 'description_right'
print_missing_values(df_shoes, col, missing_values)
Number of missing values in the column: description_right = 8108, which is 35.27% of all pairs>
col = 'brand_left'
print_missing_values(df_shoes, col, missing_values)
Number of missing values in the column: brand_left = 19182, which is 83.44% of all pairs>
col = 'brand_right'
print_missing_values(df_shoes, col, missing_values)
Number of missing values in the column: brand_right = 19822, which is 86.22% of all pairs>
col = 'price_left'
print_missing_values(df_shoes, col, missing_values)
Number of missing values in the column: price_left = 22308, which is 97.04% of all pairs>
col = 'price_right'
print_missing_values(df_shoes, col, missing_values)
Number of missing values in the column: price_right = 22394, which is 97.41% of all pairs>
col = 'specTableContent_left'
print_missing_values(df_shoes, col, missing_values)
Number of missing values in the column: specTableContent_left = 22269, which is 96.87% of all pairs>
col = 'specTableContent_right'
print_missing_values(df_shoes, col, missing_values)
Number of missing values in the column: specTableContent_right = 22324, which is 97.11% of all pairs>
plot_missing_values(missing_values, n)
Number of missing (for both right and left offer simultaneously) values per feature.
missing_values_simul = {}
col_prefix = 'title'
print_missing_values_simultaneously(df_shoes, col_prefix, missing_values_simul)
Number of missing values in the column: (both title_right and title_left): = 0, which is 0.0% of all pairs>
col_prefix = 'description'
print_missing_values_simultaneously(df_shoes, col_prefix, missing_values_simul)
Number of missing values in the column: (both description_right and description_left): = 5201, which is 22.62% of all pairs>
col_prefix = 'brand'
print_missing_values_simultaneously(df_shoes, col_prefix, missing_values_simul)
Number of missing values in the column: (both brand_right and brand_left): = 17292, which is 75.22% of all pairs>
col_prefix = 'price'
print_missing_values_simultaneously(df_shoes, col_prefix, missing_values_simul)
Number of missing values in the column: (both price_right and price_left): = 21856, which is 95.07% of all pairs>
col_prefix = 'specTableContent'
print_missing_values_simultaneously(df_shoes, col_prefix, missing_values_simul)
Number of missing values in the column: (both specTableContent_right and specTableContent_left): = 21678, which is 94.3% of all pairs>
plot_missing_values_simultaneously(missing_values_simul, n)
avg_lens = {}
avg_lens_pos = {}
avg_lens_neg = {}
col = 'title_right'
print_avg_lengths(df_shoes, col, avg_lens, avg_lens_pos, avg_lens_neg)
The average number of the column: title_right for - all pairs = 67.89238331375876, pos. pairs = 68.13938586148562, neg. pairs = 66.50861573808156
col = 'title_left'
print_avg_lengths(df_shoes, col, avg_lens, avg_lens_pos, avg_lens_neg)
The average number of the column: title_left for - all pairs = 68.16507895080255, pos. pairs = 68.32726713487466, neg. pairs = 67.25646180356117
col = 'description_left'
print_avg_lengths(df_shoes, col, avg_lens, avg_lens_pos, avg_lens_neg)
The average number of the column: description_left for - all pairs = 432.4493453390752, pos. pairs = 430.9258727636233, neg. pairs = 440.9842044801838
col = 'description_right'
print_avg_lengths(df_shoes, col, avg_lens, avg_lens_pos, avg_lens_neg)
The average number of the column: description_right for - all pairs = 428.0137892035321, pos. pairs = 431.344184139027, neg. pairs = 409.35611717403793
plot_avg_lengths(avg_lens, avg_lens_pos, avg_lens_neg, n)
df_watches.head()
| id_left | title_left | description_left | brand_left | price_left | specTableContent_left | keyValuePairs_left | category_left | cluster_id_left | identifiers_left | ... | description_right | brand_right | price_right | specTableContent_right | keyValuePairs_right | category_right | cluster_id_right | identifiers_right | label | pair_id | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 10216791 | samsung gear s2 silver smart watch with band 4... | introducing a smartwatch that s designed for m... | None | 259 99 | None | None | Cellphones_and_Accessories | 7598741 | [{'/gtin8': '[43211511]'}, {'/mpn': '[smr7200z... | ... | with elegant curves and premium finishes the s... | samsung | None | None | None | Cellphones_and_Accessories | 7598741 | [{'/mpn': '[smr7200zwaxar]'}] | 1 | 10216791#16270232 |
| 1 | 15092873 | samsung gear s2 smartwatch silver accessories ... | with elegant curves and premium finishes the s... | samsung | None | None | None | Cellphones_and_Accessories | 7598741 | [{'/mpn': '[smr7200zwaxar]'}] | ... | with elegant curves and premium finishes the s... | samsung | None | None | None | Cellphones_and_Accessories | 7598741 | [{'/mpn': '[smr7200zwaxar]'}] | 1 | 15092873#16270232 |
| 2 | 8496283 | cambridge oro ros reloj para hombre 40mm danie... | None | daniel wellington | None | None | None | Jewelry | 340023 | [{'/sku': '[1615dw00100003]'}, {'/mpn': '[dw00... | ... | artikel nr 1615 dw00100003 elegant moderne her... | daniel wellington | None | None | None | Jewelry | 340023 | [{'/sku': '[1615dw00100003]'}, {'/mpn': '[dw00... | 1 | 8496283#12803629 |
| 3 | 2688544 | daniel wellington classic man cambridge rose g... | None | None | None | meer informatie prijs 159 00 ean 7350068240034... | None | Jewelry | 340023 | [{'/sku': '[dw00100003]'}] | ... | this slim line men s daniel wellington cambrid... | None | None | None | None | Jewelry | 340023 | [{'/productID': '[dw00100003]'}] | 1 | 2688544#16348009 |
| 4 | 12803629 | cambridge rosegold herrenuhr 40mm daniel welli... | artikel nr 1615 dw00100003 elegant moderne her... | daniel wellington | None | None | None | Jewelry | 340023 | [{'/sku': '[1615dw00100003]'}, {'/mpn': '[dw00... | ... | None | None | None | meer informatie prijs 159 00 ean 7350068240034... | None | Jewelry | 340023 | [{'/sku': '[dw00100003]'}] | 1 | 12803629#11680054 |
5 rows × 22 columns
n = df_watches.shape[0]
m = df_watches.shape[1]
print(f'Number of pairs: {n}, number of features: {m}')
Number of pairs: 27027, number of features: 22
print(f'Number of positive pairs: {get_positive_pairs_count(df_watches)} - {round(get_positive_pairs_count(df_watches) * 100 / n, 2)}% of all observations.')
print(f'Number of negative pairs: {get_negative_pairs_count(df_watches)} - {round(get_negative_pairs_count(df_watches) * 100 / n, 2)}% of all observations.')
Number of positive pairs: 21864 - 80.9% of all observations. Number of negative pairs: 5163 - 19.1% of all observations.
plot_positive_vs_negative(df_watches)
We plan to concatenate the features into one text for the input of models. Therefore, missing values are manageable.
missing_values = {}
col = 'title_left'
print_missing_values(df_watches, col, missing_values)
Number of missing values in the column: title_left = 0, which is 0.0% of all pairs>
col = 'title_right'
print_missing_values(df_watches, col, missing_values)
Number of missing values in the column: title_right = 0, which is 0.0% of all pairs>
col = 'description_left'
print_missing_values(df_watches, col, missing_values)
Number of missing values in the column: description_left = 10102, which is 37.38% of all pairs>
col = 'description_right'
print_missing_values(df_watches, col, missing_values)
Number of missing values in the column: description_right = 10244, which is 37.9% of all pairs>
col = 'brand_left'
print_missing_values(df_watches, col, missing_values)
Number of missing values in the column: brand_left = 20115, which is 74.43% of all pairs>
col = 'brand_right'
print_missing_values(df_watches, col, missing_values)
Number of missing values in the column: brand_right = 20292, which is 75.08% of all pairs>
col = 'price_left'
print_missing_values(df_watches, col, missing_values)
Number of missing values in the column: price_left = 26353, which is 97.51% of all pairs>
col = 'price_right'
print_missing_values(df_watches, col, missing_values)
Number of missing values in the column: price_right = 26411, which is 97.72% of all pairs>
col = 'specTableContent_left'
print_missing_values(df_watches, col, missing_values)
Number of missing values in the column: specTableContent_left = 22595, which is 83.6% of all pairs>
col = 'specTableContent_right'
print_missing_values(df_watches, col, missing_values)
Number of missing values in the column: specTableContent_right = 22342, which is 82.67% of all pairs>
plot_missing_values(missing_values, n)
Number of missing (for both right and left offer simultaneously) values per feature.
missing_values_simul = {}
col_prefix = 'title'
print_missing_values_simultaneously(df_watches, col_prefix, missing_values_simul)
Number of missing values in the column: (both title_right and title_left): = 0, which is 0.0% of all pairs>
col_prefix = 'description'
print_missing_values_simultaneously(df_watches, col_prefix, missing_values_simul)
Number of missing values in the column: (both description_right and description_left): = 5571, which is 20.61% of all pairs>
col_prefix = 'brand'
print_missing_values_simultaneously(df_watches, col_prefix, missing_values_simul)
Number of missing values in the column: (both brand_right and brand_left): = 16773, which is 62.06% of all pairs>
col_prefix = 'price'
print_missing_values_simultaneously(df_watches, col_prefix, missing_values_simul)
Number of missing values in the column: (both price_right and price_left): = 25816, which is 95.52% of all pairs>
col_prefix = 'specTableContent'
print_missing_values_simultaneously(df_watches, col_prefix, missing_values_simul)
Number of missing values in the column: (both specTableContent_right and specTableContent_left): = 19410, which is 71.82% of all pairs>
plot_missing_values_simultaneously(missing_values_simul, n)
avg_lens = {}
avg_lens_pos = {}
avg_lens_neg = {}
col = 'title_right'
print_avg_lengths(df_watches, col, avg_lens, avg_lens_pos, avg_lens_neg)
The average number of the column: title_right for - all pairs = 69.44255744255744, pos. pairs = 68.68715697036224, neg. pairs = 72.64148750726322
col = 'title_left'
print_avg_lengths(df_watches, col, avg_lens, avg_lens_pos, avg_lens_neg)
The average number of the column: title_left for - all pairs = 69.36844636844637, pos. pairs = 69.24903951701427, neg. pairs = 69.87410420298276
col = 'description_left'
print_avg_lengths(df_watches, col, avg_lens, avg_lens_pos, avg_lens_neg)
The average number of the column: description_left for - all pairs = 576.9012839012839, pos. pairs = 557.580223197951, neg. pairs = 658.7210923881464
col = 'description_right'
print_avg_lengths(df_watches, col, avg_lens, avg_lens_pos, avg_lens_neg)
The average number of the column: description_right for - all pairs = 572.1730491730492, pos. pairs = 585.2401664837175, neg. pairs = 516.8371102072439
plot_avg_lengths(avg_lens, avg_lens_pos, avg_lens_neg, n)
import plotly
## Preserve plotly functionality in a .html file
plotly.offline.init_notebook_mode()